Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
#Connecting Google drive with Google colab
# Reading the data-set into Google colab
from google.colab import drive
drive.mount('/content/drive')
#Reading the "stock_data.csv" dataset into a dataframe (i.e.loading the data)
path="/content/drive/My Drive/stock_data.csv"
data = pd.read_csv(path)
# returning the first 5 rows using the dataframe head method
data.head()
# returning the last 5 rows using dataframe tail method
data.tail()
#checking shape of the dataframe to find out the number of rows and columns using the dataframe shape command
print("There are", data.shape[0], 'rows and', data.shape[1], "columns.")
# Using the dataframe info() method to print a concise summary of the DataFrame
data.info()
Observation
All of these dtypes are appropriate for their respective columns
Total memory usage is approximately 40.0+ KB.
# copying the data to another variable to avoid any changes to original data
df = data.copy()
# checking the statistical summary of the data using describe command and transposing.
df.describe(include='all').T
Observation
There are 340 observations present in all
Differences between mean and median values indicate skewness in the data
There are 11 GICS Sectors of which most of the are of the industrial sector
There are 104 GICS Sub induustry
# Checking for missing values
df.isnull().sum()
Observation No null values present, therefore no missing values listed
df.nunique()
# checking for duplicate values
df.duplicated().sum()
Observation
Questions:
from matplotlib import patches
import random
#creating a histogram and Boxplot using function
def histobox_plot(df, column, figsize=(15, 10), kde=False, bins=None):
#plt.figure(figsize = (20,10))
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above)
sns.set(style="darkgrid")
# creating a figure composed of two matplotlib.Axes objects (ax_box and ax_hist)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)},figsize=figsize,)
# assigning a graph to each ax
sns.boxplot(df, x=column, ax=ax_box,showmeans=True, color="violet")
sns.histplot(data=df, x=column, ax=ax_hist)
ax_hist.axvline(
data[column].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist.axvline(
data[column].median(), color="black", linestyle="-"
) # Add median to the histogram
# Remove x axis name for the boxplot
ax_box.set(xlabel='')
for p in ax_hist.patches:
height = p.get_height() # get the height of each bar
# adding text to each bar
ax_hist.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
ha = 'center') # sets horizontal alignment (ha) to center
Current Price
histobox_plot(df, 'Current Price')
Observation
Price Change
histobox_plot(df, 'Price Change')
Observation
Volatility
histobox_plot(df, 'Volatility')
Observation
ROE
histobox_plot(df, 'ROE')
Observation
Cash Ratio
histobox_plot(df, 'Cash Ratio')
Observation
Net Cash Flow
histobox_plot(df, 'Net Cash Flow')
Observation
Net Income
histobox_plot(df, 'Net Income')
Observation
Earnings Per Share
histobox_plot(df, 'Earnings Per Share')
Observation
Estimated Shares Outstanding
histobox_plot(df, 'Estimated Shares Outstanding')
Observation
P/E Ratio
histobox_plot(df, 'P/E Ratio')
Observation
P/B Ratio
histobox_plot(df, 'P/B Ratio')
Observation
# function to create labeled barplots
def labeled_barplot(df, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df[feature]) # length of the column
count = df[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df,
x=feature,
palette="Paired",
order=df[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
GICS Sector
labeled_barplot(df, 'GICS Sector', perc=True)
GICS Sub Industry
labeled_barplot(df, 'GICS Sub Industry', perc=True)
# #Using heatmap to check correlation between series
plt.figure(figsize=(15, 7))
sns.heatmap(
df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
Observation :Some variables exhibit moderate correlations (+/- .40) with one another.
The stocks of which economic sector have seen the maximum price increase on average?.
df.groupby('GICS Sector')['Price Change'].mean().sort_values()
plt.figure(figsize=(15,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Price Change', ci=False)
for i in ax.containers:
ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()
Observation In the preceding period, stocks within the healthcare sector have demonstrated the most notable average price escalation.
Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?.
df.groupby('GICS Sector')['Cash Ratio'].mean().sort_values(ascending=False)
plt.figure(figsize=(20,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Cash Ratio', ci=False)
for i in ax.containers:
ax.bar_label(i,)
plt.show()
Observation
P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. Let's see how the P/E ratio varies, on average, across economic sectors.
df.groupby('GICS Sector')['P/E Ratio'].mean().sort_values(ascending=False)
plt.figure(figsize=(15,8))
ax=sns.barplot(data=df, x='GICS Sector', y='P/E Ratio', ci=False)
for i in ax.containers:
ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()
Observation
Volatility is responsible for the oscillations observed in stock prices. Stocks characterized by high volatility tend to experience more pronounced fluctuations, thereby increasing their investment risk. Now, let's examine the average variations in volatility within different economic sectors.
df.groupby('GICS Sector')['Volatility'].mean().sort_values(ascending=False)
plt.figure(figsize=(15,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Volatility', ci=False)
for i in ax.containers:
ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()
Observation
The Earnings Per Share (EPS) metric plays a significant role in influencing stock investments. EPS represents the portion of a company's profit allocated to each outstanding share of its common stock. It serves as a crucial indicator of a company's financial health and performance, impacting investor decisions Now, let's examine the average variations in Earnings Per Share within different economic sectors.
df.groupby('GICS Sector')['Earnings Per Share'].mean().sort_values(ascending=False)
plt.figure(figsize=(15,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Earnings Per Share', ci=False)
for i in ax.containers:
ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()
Observation
plt.figure(figsize=(15, 12))
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observation
num_col = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio','Net Cash Flow','Net Income','Earnings Per Share','Estimated Shares Outstanding','P/E Ratio','P/B Ratio']
num_col
# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[numeric_columns].copy()
subset_scaled = scaler.fit_transform(subset)
# Creating a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
#create pairplot for scaled dataframe
sns.pairplot(subset_scaled_df,diag_kind='kde')
subset_scaled_df.head()
k_means_df = subset_scaled_df.copy()
#print average distortions for range of kmeans models fitted to scaled dataset
clusters = range(1, 15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(subset_scaled_df)
prediction = model.predict(k_means_df)
distortion = (
sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
/ k_means_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df)
visualizer.show()
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
preds = clusterer.fit_predict((subset_scaled_df))
score = silhouette_score(k_means_df, preds)
sil_score.append(score)
print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
plt.show()
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(14, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(13, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(12, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(11, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(10, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(9, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(8, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(2, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Observation
# final K-means model
kmeans = KMeans(n_clusters=6, random_state=1)
kmeans.fit(k_means_df)
# creating a copy of the original data
df1 = df.copy()
# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["KM_segments"] = kmeans.labels_
df1["KM_segments"] = kmeans.labels_
km_cluster_profile = df1.groupby("KM_segments").mean()
km_cluster_profile["count_in_each_segment"] = (
df1.groupby("KM_segments")["Security"].count().values
)
km_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
## Complete the code to print the companies in each cluster
for cl in df1["KM_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df1[df1["KM_segments"] == cl]["Security"].unique())
print()
#print number of stocks within each sector for all of the clusters
df1.groupby(["KM_segments", "GICS Sector"])['Security'].count()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df1, x="KM_segments", y=variable)
plt.tight_layout(pad=2.0)
Cluster 0
Cluster 1
Cluster 2
Companies within this cluster have:
Moderate Volatility
Cluster 3
Cluster 4
Cluster 5
hc_df = subset_scaled_df.copy()
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Let's explore different linkage methods with Euclidean distance only.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hc_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Let's view the dendrograms for the different linkage methods with Euclidean distance.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hc_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
compare.append([method, coph_corr])
Observation
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc
HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="average")
HCmodel.fit(hc_df)
# creating a copy of the original data
df2 = df.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_
hc_cluster_profile = df2.groupby("HC_segments").mean()
hc_cluster_profile["count_in_each_segment"] = (
df2.groupby("HC_segments")["Security"].count().values
)
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df2[df2["HC_segments"] == cl]["Security"].unique())
print()
df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df2, x="HC_segments", y=variable)
plt.tight_layout(pad=2.0)
Let us try using Ward linkage as it has more distinct and separated clusters (as seen from it's dendrogram before). 6 appears to be the appropriate number of clusters from the dendrogram for Ward linkage.
HCmodel = AgglomerativeClustering(n_clusters=6, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
# creating a copy of the original data
df2 = df.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_
hc_cluster_profile = df2.groupby("HC_segments").mean()
hc_cluster_profile["count_in_each_segment"] = (
df2.groupby("HC_segments")["Security"].count().values
)
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df2[df2["HC_segments"] == cl]["Security"].unique())
print()
df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df2, x="HC_segments", y=variable)
plt.tight_layout(pad=2.0)
Cluster 0
Cluster 1
7 stocks, comprised mostly of stocks within the Consumer Discretionary, Consumer Staples, Energy, Financials, and Industrials
Companies within this cluster have:
Cluster 2
Cluster 3
Cluster 4
Cluster 5
You compare several things, like:
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.
Which clustering technique took less time for execution?
How many clusters are obtained as the appropriate number of clusters from both algorithms? In the case of both algorithms, the utilization of 6 clusters resulted in well-defined clusters, each containing an ample number of observations to effectively distinguish the representative "type" of stock within the cluster.
Differences or similarities in the cluster profiles from both the clustering techniques
-
Trade&Ahead should initially assess their clients' financial objectives, risk tolerance, and investment patterns before suggesting a cluster that aligns with these requirements as a potential stock portfolio.
Nevertheless, a considerable number of these clusters, based on their constituent stock attributes, essentially serve as alternatives to standard indexes. Achieving these objectives could potentially be more straightforward using such indexes.
Alternatively, Trade&Ahead could employ these clusters as a foundational step for conducting comprehensive financial statement analysis. This approach could specifically highlight stocks that deviate from the cluster "profile."
If the process of handpicking individual stocks is integral to a client's investment approach, Trade&Ahead may then be well-positioned to identify stocks that are likely to outperform their counterparts (resulting in a buy recommendation) or underperform (prompting a sell recommendation).